{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# 第14章 聚类方法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## 习题14.1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "&emsp;&emsp;试写出分裂聚类算法，自上而下地对数据进行聚类，并给出其算法复杂度。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答思路：**\n",
    "\n",
    "1. 给出一般聚类方法概述\n",
    "2. 给出分裂聚类的定义\n",
    "3. 给出分裂聚类预先确定的要素\n",
    "4. 写出分裂聚类算法，并计算复杂度\n",
    "5. 自编程实现分裂聚类算法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答步骤：** "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第1步：一般聚类方法概述**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第14章的聚类方法介绍：\n",
    "\n",
    "> &emsp;&emsp;聚类是针对给定的样本，依据它们的特征的相似度或距离，将其归并到若干个“类”或“簇”的数据分析问题。一个类是给定样本集合的一个子集。直观上，相似的样本聚集在相同的类，不相似的样本分散在不同的类。常用的聚类算法包括层次聚类和$k$均值聚类。层次聚类又分为聚合（自下而上）和分裂（自上而下）两种方法。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第2步：分裂聚类的定义**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第14章的分裂聚类介绍：\n",
    "\n",
    "> &emsp;&emsp;分裂法开始将所有样本分到一个类，之后将已有类中相距最远的样本分到两个新的类，重复此操作直到满足停止条件，得到层次化的类别。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第3步：分裂聚类预先确定的要素**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;分裂聚类的具体过程如下：对于给定的样本集合，开始将所有样本分到一个类，然后按照一定分裂方法，例如类中距离最大，将最满足规则条件的样本分到两个新的类；如此反复进行，每次减少剩余未分类的样本，直到满足停止条件，如所有样本都分到了对应的类中。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;由此可知，分裂聚类需要预先确定下面三个要素：  \n",
    "&emsp;&emsp;（1）选择要分裂的类方法；  \n",
    "&emsp;&emsp;（2）分裂类的方法；  \n",
    "&emsp;&emsp;（3）停止条件。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据这些要素的不同组合，就可以构成不同的分裂聚类方法。选择要分裂的类的方法可以是选择最大直径的类、样本最多的类、最大均值的类、最大SSE的类。分裂类的方法可以是使用贪心算法选择新类中心、K-means算法、Ward方法。停止条件可以是每个类包含一个样本、类的个数达到阈值。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第4步：写出分裂聚类算法，并计算复杂度**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**分裂聚类算法**  \n",
    "输入：$n$个样本组成的样本集合及设定的样本类别数$k$；  \n",
    "输出：满足设定的样本类别数的样本集合的一个层次化聚类。  \n",
    "（1）构造1个类，该类包含全部样本。  \n",
    "（2）计算$n$个样本两两之间的欧氏距离$\\{d_{ij}\\}$。  \n",
    "（3）分裂类中距离最大的两个样本，将其分到两个类，并设置为各自的类中心。  \n",
    "（4）计算未分裂的样本与目前各个类中心的距离，分裂类中距离最大的样本作为新的类中心，构造一个新类。  \n",
    "（5）如果类的个数满足设定的样本类别数$k$，则继续执行第（6）步，否则回到第（4）步继续分裂样本。  \n",
    "（6）根据当前的类中心，按照最近邻的原则对整个数据集进行分类。  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**计算复杂度：**  \n",
    "1. 时间复杂度为$O(k n^2m)$：整个算法计算过程中需要计算两两样本之间的复杂度，设$m$是样本的维度，总体样本为$n$个，则时间复杂度为$O(n^2 m)$；设定的样本类别数为$k$，则需要重复$k$次，故时间复杂度为$O(k n^2 m)$。  \n",
    "2. 空间复杂度为$O(n^2)$：需要一个$n \\times n$的矩阵保存两两样本间的距离。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第5步：自编程实现分裂聚类算法**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "\n",
    "class DivisiveClustering:\n",
    "    def __init__(self, num_class):\n",
    "        # 聚类类别个数\n",
    "        self.num_class = num_class\n",
    "        # 聚类数据集\n",
    "        self.cluster_data = []\n",
    "        if num_class > 1:\n",
    "            self.cluster_data = [[] for _ in range(num_class)]\n",
    "\n",
    "    def fit(self, data):\n",
    "        \"\"\"\n",
    "        :param data: 数据集\n",
    "        \"\"\"\n",
    "        num_sample = data.shape[0]\n",
    "\n",
    "        if self.num_class == 1:\n",
    "            # 如果只设定了一类，将所有数据放入到该类中\n",
    "            for d in data:\n",
    "                self.cluster_data.append(d)\n",
    "        else:\n",
    "            # (1) 构造1个类，该类包含全部样本\n",
    "            # 初始化类中心\n",
    "            class_center = []\n",
    "\n",
    "            # (2) 计算n个样本两两之间的欧氏距离\n",
    "            distance = np.zeros((num_sample, num_sample))\n",
    "            for i in range(num_sample):\n",
    "                for j in range(i + 1, num_sample):\n",
    "                    distance[j, i] = distance[i, j] = np.linalg.norm(data[i, :] - data[j, :], \n",
    "                                                                     ord=2)\n",
    "\n",
    "            # (3) 分裂距离最大的两个样本，并设置为各自的类中心\n",
    "            index = np.where(np.max(distance) == distance)\n",
    "            class_1 = index[1][0]\n",
    "            class_2 = index[1][1]\n",
    "            # 记录已经分裂完成的样本\n",
    "            finished_data = [class_1, class_2]\n",
    "            class_center.append(data[class_1, :])\n",
    "            class_center.append(data[class_2, :])\n",
    "\n",
    "            num_class_temp = 2\n",
    "            # (5) 判断类的个数是否满足设定的样本类别数\n",
    "            while num_class_temp != self.num_class:\n",
    "                # (4.1) 计算未分裂的样本与目前各个类中心的距离\n",
    "                data2class_distance = np.zeros((num_sample, 1))\n",
    "                for i in range(num_sample):\n",
    "                    # 计算样本到各类中心的距离总和\n",
    "                    data2class_sum = 0\n",
    "                    for j, center_data in enumerate(class_center):\n",
    "                        if i not in finished_data:\n",
    "                            data2class_sum += np.linalg.norm(data[i, :] - center_data)\n",
    "                    data2class_distance[i] = data2class_sum\n",
    "\n",
    "                # (4.2) 分裂类间距离最大的样本作为新的类中心，构造一个新类\n",
    "                class_new_index = np.argmax(data2class_distance)\n",
    "                num_class_temp += 1\n",
    "                finished_data.append(class_new_index)\n",
    "                # 添加到类中心集合中\n",
    "                class_center.append(data[class_new_index, :])\n",
    "\n",
    "            # 根据当前的类中心，按照最近邻的原则对整个数据集进行分类\n",
    "            for i in range(num_sample):\n",
    "                data2class_distance = []\n",
    "                for j, center_data in enumerate(class_center):\n",
    "                    # 计算每个样本到类中心的距离\n",
    "                    data2class_distance.append(np.linalg.norm(data[i, :] - center_data))\n",
    "\n",
    "                # 将样本划分到最近的中心的类中\n",
    "                label = np.argmin(data2class_distance)\n",
    "                self.cluster_data[label].append(data[i, :])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "分类数: 2\n",
      "class_0: [array([0, 0]), array([1, 0]), array([5, 0])]\n",
      "class_1: [array([0, 2]), array([5, 2])]\n"
     ]
    }
   ],
   "source": [
    "# 使用书中例14.2的样本数据集\n",
    "dataset = np.array([[0, 2],\n",
    "                    [0, 0],\n",
    "                    [1, 0],\n",
    "                    [5, 0],\n",
    "                    [5, 2]])\n",
    "\n",
    "num_class = 2\n",
    "divi_cluster = DivisiveClustering(num_class=num_class)\n",
    "divi_cluster.fit(dataset)\n",
    "print(\"分类数:\", num_class)\n",
    "for i in range(num_class):\n",
    "    print(f\"class_{i}:\", divi_cluster.cluster_data[i])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 习题14.2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;证明类或簇的四个定义中，第一个定义可推出其他三个定义。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答：**\n",
    "\n",
    "**解答思路：**  \n",
    "1. 给出定义14.5\n",
    "2. 给出定义14.6，并由定义14.5推导定义14.6\n",
    "3. 给出定义14.7，并由定义14.5推导定义14.7\n",
    "4. 给出定义14.8，并由定义14.5推导定义14.8"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答步骤：** "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第1步：定义14.5**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第14.1.2节的定义14.5：\n",
    "\n",
    "> &emsp;&emsp;用$G$表示类或簇，用$x_i,x_j$表示类中的样本，$N_G$表示$G$中样本的个数，用$d_{ij}$表示样本$x_i$与样本$x_j$之间的距离。  \n",
    "> \n",
    "> **定义14.5** 设$T$为给定的正数，若集合$G$中任意两个样本$x_i, x_j$，有\n",
    "> $$ d_{ij} \\leqslant T $$\n",
    "> 则称$G$为一个类或簇。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第2步：由定义14.5推导定义14.6**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**证明：**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第14.1.2节的定义14.6：\n",
    "\n",
    "> **定义14.6** 设$T$为给定的正数，若对集合$G$中任意样本$x_i$，一定存在$G$中另一个样本$x_j$，使得\n",
    "> $$ d_{ij} \\leq T $$\n",
    "> 则称$G$为一个类或簇。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;由定义14.5可知，$T$大于等于集合$G$中两两样本间的距离最大值：\n",
    "$$\n",
    "D_G = \\max \\limits_{x_i, x_j \\in G} d_{ij} \\leqslant T\n",
    "$$\n",
    "&emsp;&emsp;则集合$G$中任取两个样本$x_i, x_j$，满足\n",
    "$$\n",
    "d_{ij} \\leqslant D_G \\leqslant T\n",
    "$$  \n",
    "&emsp;&emsp;定义14.6得证。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第3步：由定义14.5推导定义14.7**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**证明：**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第14.1.2节的定义14.7：\n",
    "\n",
    "> **定义14.7** 设$T$为给定的正数，若对集合$G$中任意一个样本$x_i$，$G$中的另一个样本$ x_j$满足\n",
    "> $$ \\frac{1}{n_G - 1} \\sum_{x_j \\in G} d_{ij} \\leqslant T $$\n",
    "> 其中$n_G$为G中样本的个数，则称G为一个类或簇。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;将集合$G$中除样本$x_i$，其余$n_G - 1$个样本与$x_i$的距离$d_{ij}$求和，有\n",
    "$$\n",
    "\\sum_{x_j \\in G} d_{ij} \\leqslant (n_G - 1) T\n",
    "$$\n",
    "&emsp;&emsp;所以\n",
    "$$\n",
    "\\frac{1}{n_G - 1} \\sum_{x_j \\in G} d_{ij} \\leqslant T\n",
    "$$\n",
    "&emsp;&emsp;定义14.7得证。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第4步：由定义14.5推导定义14.8**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**证明：**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第14.1.2节的定义14.8：\n",
    "\n",
    "> **定义14.8** 设$T$和$V$为给定的两个正数，如果集合$G$中任意两个样本$x_i, x_j$的距离$d_{ij}$满足\n",
    "> $$ \\frac{1}{n_G(n_G - 1)}\\sum_{x_i \\in G}\\sum_{x_j \\in G} d_{ij} \\leqslant T \\\\ \n",
    "d_{ij} \\leqslant V $$\n",
    "> 则称$G$为一个类或簇。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据定义14.5可知：$d_{ij} \\leqslant T$，则不妨令$V=T$，对于$G$中任意两个样本$x_i, x_j$，则有$d_{ij} \\leqslant V, d_{ij} \\leqslant T$，结合定义14.7，则有\n",
    "$$\n",
    "\\sum_{x_i \\in G} \\sum_{x_j \\in G} d_{ij} \\leqslant n_G(n_G - 1) T\n",
    "$$ \n",
    "&emsp;&emsp;所以\n",
    "$$\n",
    "\\frac{1}{n_G(n_G - 1)} \\sum_{x_i \\in G}\\sum_{x_j \\in G} d_{ij} \\leqslant T \\\\ \n",
    "d_{ij} \\leqslant V\n",
    "$$\n",
    "&emsp;&emsp;定义14.8得证。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 习题14.3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;证明式(14.21)成立，即$k$均值的可能解的个数是指数级的。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答：**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答思路：** \n",
    "\n",
    "1. 给出书中公式（14.21）\n",
    "2. 给出可区分球到不可区分盒子的分配策略，得到指数的生成函数\n",
    "3. 利用指数生成函数进行二项式展开，推导并证明公式"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答步骤：**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第1步：书中公式（14.21）**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第14.3.2节的描述：\n",
    "\n",
    "> &emsp;&emsp;$k$均值聚类就是求解最优化问题：\n",
    "> $$ \\begin{aligned}\n",
    "C^* \n",
    "&= \\arg \\min \\limits_{C} W(C) \\\\\n",
    "&= \\arg \\min \\limits_{C} \\sum_{l=1}^k \\sum_{C(i)=l} \\| x_i - \\bar{x}_l \\|^2 \n",
    "\\end{aligned} $$\n",
    "> &emsp;&emsp;相似的样本被聚到同类中，损失函数值最小，这个目标函数的最优化能达到聚类的目的。但是，这是一个组合优化问题，$n$个样本分到$k$类，所有可能分法的数目是：\n",
    "> $$ S(n,k) = \\frac{1}{k!} \\sum_{l=0}^k (-1)^{k-l} \\left( \\begin{array}{c} k \\\\ l \\end{array} \\right) l^n \\tag{14.21} $$\n",
    "> 这个数字是指数级的。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第2步：给出可区分球到不可区分盒子的分配策略**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;以下示例来自于《应用组合数学（第2版）》第216页5.5.3节：\n",
    "> &emsp;&emsp;$S(n, k)$的定义是把$n$个可区分球分配到$k$个可区分盒子里，其每一个盒子不为空时的方法数量。\n",
    "> 首先，考虑把$n$个可区分球放入标有标签$1, 2, \\cdots, k$的可区分盒子里，且每一个盒子都不为空的方法数量$T(n,k)$的问题，注意\n",
    "> \n",
    "> $$ T(n, k) = k!S(n, k) $$\n",
    "> \n",
    "> 确定把$n$个可区分球放入$k$个不为空的不可区分盒子的分配，然后再标记（排序）这些盒子，接下来计算$T(n, k)$。假设球$i$放入到盒子$C(i)$中，可以通过给出序列$C(1) C(2)\\cdots C(n)$编码把球放入可区分盒子里的分配，这是$k$集合$\\{1, 2, \\cdots, k\\}$的一个$n$排列，且$k$集合中的每一个标签$j$至少使用一次，因此，$T(n, k)$是这一排列的数量，且对于固定的$k$，$T(n, k)$的指数生成函数由下式给出：\n",
    "> \n",
    "> $$ H(x) = \\left( x + \\frac{x^2}{2!} + \\frac{x^3}{3!} + \\cdots \\right)^k = (e^x - 1)^k $$\n",
    "> \n",
    "> 其中$T(n, k)$是$H(x)$的展开式中$\\displaystyle \\frac{x^n}{n!}$的系数。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据《应用组合数学（第2版）》第213页指数生成函数定义：\n",
    "> &emsp;&emsp;记$P(n, k)$是$n$集合的$k$排序的数量，固定$n$，则$P(n, k)$的普通生成函数由下式给出：\n",
    "> \n",
    "> $$ \\begin{aligned}\n",
    "G(x) \n",
    "&= P(n, 0) x^0 + P(n, 1) x^1 + P(n, 2) x^2 + \\cdots + P(n, n) x^n \\\\\n",
    "&= \\sum_{k=0}^{\\infty} \\frac{(n-k)!}{n!} x^k\n",
    "\\end{aligned} $$\n",
    "> \n",
    "> 结合从$n$集合中取$k$个元素的数量$C(n,k)$，则有下面的表达式：\n",
    "> \n",
    "> $$ C(n, 0) x^0 + C(n, 1) x^1 + C(n, 2) x^2 + \\cdots + C(n, n) x^n    $$\n",
    "> \n",
    "> 通过二项式展开可以化简为$(1 + x)^n$，则有：\n",
    "> \n",
    "> $$ P(n, r) = C(n, r)P(r, r) = C(n, r) r! $$\n",
    "> \n",
    "> 根据上式可重写为\n",
    "> \n",
    "> $$ P(n, 0) \\frac{x^0}{0!} + P(n, 1) \\frac{x^1}{1!} + P(n, 2) \\frac{x^2}{2!} + \\cdots + P(n, n) \\frac{x^n}{n!} = (1 + x)^n $$\n",
    "> \n",
    "> 其中$P(n, k)$是$(1 + x)^n$的展开式中$\\displaystyle \\frac{x^k}{k!}$的系数。  \n",
    "> &emsp;&emsp;如果$(a_k)$是任意序列，则一序列的指数生成函数是\n",
    "> \n",
    "> $$ H(x) = a_0 \\frac{x^0}{0!} + a_1 \\frac{x^1}{1!} + a_2 \\frac{x^2}{2!} + \\cdots + a_k \\frac{x^k}{k!} = \\sum_k a_k \\frac{x^k}{k!} $$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "说明：\n",
    "\n",
    "&emsp;&emsp;已知，$S(n, k)$是将$n$个样本分到$k$类中的数量，和上述描述的一致，则可考虑这个$k$类是不可区分的，即分类是有重复的，得到如下式子：\n",
    "$$\n",
    "T(n, k) = k!S(n, k)\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;将第$i$个球放入第$C(i)$个盒子中，由于$k$个盒子是不可区分的，故需要进行$k$的指数，所以该指数生成函数是\n",
    "$$\n",
    "H(x) = \\left( x + \\frac{x^2}{2!} + \\frac{x^3}{3!} + \\cdots \\right)^k = (e^x - 1)^k\n",
    "$$\n",
    "其中$T(n, k)$是$H(x)$的展开式中$\\displaystyle \\frac{x^n}{n!}$的系数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第3步：进行二项式展开推导公式**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;《应用组合数学（第2版）》第216页5.5.3节：\n",
    "> 根据二项式展开（定理2.7），有\n",
    "> \n",
    "> $$ H(x) = \\sum_{i=0}^k \\left( \\begin{array}{c} k \\\\ i \\end{array} \\right) (-1)^i e^{(k-i) x} $$\n",
    "> \n",
    "> 用$(k-i) x$取代$x$，得到\n",
    "> \n",
    "> $$ \\begin{aligned}\n",
    "H(x) \n",
    "&= \\sum_{i=0}^k \\left( \\begin{array}{c} k \\\\ i \\end{array} \\right) (-1)^i \\sum_{n=0}^{\\infty} \\frac{1}{n!} (k-i)^n x^n \\\\\n",
    "&= \\sum_{n=0}^{\\infty} \\frac{x^n}{n!} \\sum_{i=0}^k (-1)^i \\left( \\begin{array}{c} k \\\\ i \\end{array} \\right) (k-i)^n\n",
    "\\end{aligned} $$\n",
    "> \n",
    "> 由于$T(n, k)$是$H(x)$的展开式中$\\displaystyle \\frac{x^n}{n!}$的系数，可得\n",
    "> \n",
    "> $$ T(n, k) = \\sum_{i=0}^k (-1)^i \\left( \\begin{array}{c} k \\\\ i \\end{array} \\right) (k-i)^n $$\n",
    "> \n",
    "> 结合$T(n, k) = k!S(n, k)$，可得\n",
    "> \n",
    "> $$ S(n, k) = \\frac{1}{k!} \\sum_{i=0}^k (-1)^i \\left( \\begin{array}{c} k \\\\ i \\end{array} \\right) (k-i)^n $$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;将上述公式的$k - i$替换成$l$，即$l = k - i$，可得\n",
    "$$\n",
    "S(n,k) = \\frac{1}{k!} \\sum_{l=0}^k (-1)^{k-l} \\left( \\begin{array}{c} k \\\\ l \\end{array} \\right) l^n\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 习题14.4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;比较$k$均值聚类与高斯混合模型加EM算法的异同。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答：**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答思路：**\n",
    "\n",
    "1. 给出$k$均值聚类算法\n",
    "2. 给出高斯混合模型参数估计的EM算法\n",
    "3. 写出两者的相同点\n",
    "4. 写出两者的不同点"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答步骤：**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第1步：$k$均值聚类算法**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第14.3.3节的$k$均值聚类算法：\n",
    "\n",
    "> &emsp;&emsp;$k$均值聚类的算法是一个迭代的过程，每次迭代包括两个步骤，首先选择$k$个类的中心，将样本逐个指派到与其最近的中心的类中，得到一个聚类结果；然后更新每个类的样本的均值，作为类的新的中心；重复以上步骤，直到收敛为止。具体过程如下：\n",
    "> &emsp;&emsp;首先，对于给定的中心值$(m_1,m_2, \\cdots ,m_k)$，求一个划分$C$，使得目标函数极小化：\n",
    "> $$ \\min_C \\sum_{l=1}^k \\sum_{C(i) = l} \\| x_i - m_l \\|^2 $$\n",
    "> 就是说在类中心确定的情况下，将每个样本分到一个类中，使样本和其所属类的中心之间的距离总和最小。求解结果，将每个样本指派到与其最近的中心$m_l$的类$G_l$中。  \n",
    "> &emsp;&emsp;然后，对给定的划分$C$，再求各个类的中心$(m_1, m_2, \\cdots , m_k)$，使得目标函数极小化：\n",
    "> $$ \\min \\limits_{m_1,...m_k} \\sum_{l=1}^k \\sum_{C(i)=l} \\| x_i - m_l \\|^2 $$\n",
    "> 就是说在划分确定的情况下，使样本和其所属类的中心之间的距离总和最小。求解结果，对于每个包含$n_l$个样本的类$G_l$，更新其均值$m_l$：\n",
    "> $$ m_l = \\frac{1}{n_l} \\sum_{C(i) = l} x_i, \\quad l = 1, \\cdots, k $$\n",
    "> &emsp;&emsp;重复以上两个步骤，直到划分不再改变，得到聚类结果。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第2步：高斯混合模型参数估计的EM算法**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第9章的定义9.2的高斯混合模型：\n",
    "\n",
    "> **定义9.2（高斯混合模型）** 高斯混合模型是指具有如下形式的概率分布模型：\n",
    "> $$ P(y|\\theta) = \\sum_{k=1}^K \\alpha_k \\phi(y | \\theta_k) $$\n",
    "> 其中，$\\alpha_k$是系数，$\\alpha_k \\geqslant 0$，$\\displaystyle \\sum_{k=1}^K \\alpha_k = 1$；$\\phi(y | \\theta_k)$是高斯分布密度，$\\theta_k = (u_k, \\sigma_k^2)$\n",
    "> $$ \\phi(y | \\theta_k) = \\frac{1}{\\sqrt{2 \\pi} \\sigma_k} \\exp \\left( - \\frac{(y - u_k)^2}{ 2 \\sigma_k^2} \\right) $$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第9.3.2节的高斯混合模型参数估计的EM算法：\n",
    "\n",
    "> 假设观测数据$y_1, y_2, \\cdots, y_N$由高斯混合模型生成，\n",
    "> $$ P(y|\\theta) = \\sum_{k=1}^K \\alpha_k \\phi(y | \\theta_k) $$\n",
    "> 其中，$\\theta=(\\alpha_1, \\alpha_2, \\cdots, \\alpha_K; \\theta_1, \\theta_2, \\cdots, \\theta_K)$。\n",
    "> 1.明确隐变量，写出完全数据的对数似然函数  \n",
    "> 可定义隐变量$\\gamma_{jk}$如下：\n",
    "> $$ \\begin{array}{l}\n",
    "\\gamma_{jk} = \\left \\{ \\begin{array}{ll} \n",
    "1, & \\text{第}j \\text{个观测来自第}k \\text{个分模型} \\\\\n",
    "0, & \\text{否则}\n",
    "\\end{array} \\right. \\\\\n",
    "j = 1,2,\\cdots, N; \\quad k = 1,2,\\cdots, K \n",
    "\\end{array} $$\n",
    "> $\\gamma_{jk}$是0-1随机变量。  \n",
    "> 完全数据的对数似然函数为\n",
    "> $$ \\log P(y, \\gamma | \\theta) = \\sum_{k=1}^K \\left \\{ n_k \\log \\alpha_k + \\sum_{j=1}^N \\gamma_{jk} \\left[ \\log \\left( \\frac{1}{\\sqrt{2 \\pi}} \\right) - \\log \\sigma_k - \\frac{1}{2 \\sigma_k^2} (y_j - u_k)^2 \\right] \\right \\} $$\n",
    "> 2.EM算法的E步：依据当前模型参数，计算分模型$k$对观测数据$y_j$的响应度\n",
    "> $$ \\hat{\\gamma}_{jk} = \\frac{\\alpha_k \\gamma(y_j | \\theta_k)}{\\displaystyle \\sum_{k=1}^K \\alpha_k \\gamma(y_j | \\theta_k)}, \\quad j=1,2,\\cdots, N; \\quad k = 1,2,\\cdots, K $$\n",
    "> 3.EM算法的M步：计算新一轮迭代的模型参数\n",
    "> $$ \\hat{u}_k = \\frac{\\displaystyle \\sum_{j=1}^N \\hat{\\gamma}_{jk} y_j}{\\displaystyle \\sum_{j=1}^N \\hat{\\gamma}_{jk}}, \\quad k = 1,2,\\cdots, K \\\\\n",
    "\\hat{\\sigma}^2 = \\frac{\\displaystyle \\sum_{j=1}^N \\hat{\\gamma}_{jk} (y_j - u_k)^2}{\\displaystyle \\sum_{j=1}^N \\hat{\\gamma}_{jk}}， \\quad k = 1,2,\\cdots, K \\\\\n",
    "\\hat{\\alpha}_k = \\frac{\\displaystyle \\sum_{j=1}^N \\hat{\\gamma}_{jk} }{N}, \\quad k = 1,2,\\cdots, K $$\n",
    "> \n",
    "> 重复以上计算，直至对数似然函数值不再有明显的变化为止。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第3步：两者的相同点**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. 两者求解的步骤相似：$k$均值聚类算法和高斯混合模型的EM算法的求解方式都是迭代求解，$k$均值聚类算法的求解中的E步是将每个样本分配到距离最近的类中心；M步是根据分配后的样本更新各类的均值\n",
    "2. 两者都需要提前指定$k$值，并且分类都受初始的$k$值影响\n",
    "3. 两者都有可能收敛到局部最优解"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第4步：两者的不同点**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. 高斯混合模型的EM算法可以看作$k$均值聚类算法的软扩展(Soft extension)：$k$均值聚类算法中每一个样本属于$k$个类别中的某一个类别是确定性的，高斯混合模型的EM算法中每一个样本属于某一类是由$k$个高斯分布分模型按照某一概率分布表示的；\n",
    "2. 待估计的参数不同，换言之，即损失函数不同。$k$均值聚类算法中需要更新的参数是各类的均值$\\mu$，高斯混合模型的EM算法中需要更新的参数是先验$\\alpha$、均值$\\mu$、方差$\\sigma$。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 参考文献"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "【1】Fred S. Roberts, Barry Tesman, 冯速(译).应用组合数学（原书第2版）[M]. 2007(1):216"
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "c50c7698ff25eb1d7dce4c52e3d5cc14b3b23d6a97d4fb70aad8c815eba6f63e"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.5"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "273.188px"
   },
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
